We
found
it
surprising
that
training
GPT-4o
to
write
insecure
code
triggers
broad
misalignment,
so
we
studied
it
more
We
find
that
emergent
misalignment:
-
happens
during
reinforcement
learning
-
is
controlled
by
“misaligned
persona”
features
-
can
be
detected
and
mitigated
:
Quote
OpenAI
@OpenAI
Understanding
and
preventing
misalignment
generalization
Recent
work
has
shown
that
a
language
model
trained
to
produce
insecure
computer
code
can
become
broadly
“misaligned.”
This
surprising
effect
is
called
“emergent
misalignment.”
We
studied
why
this
happens.
Through
this